3 research outputs found
On developing an automatic threshold applied to feature selection ensembles
© 2019. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/. This version of the article "R.-J. Palma-Mendoza, L. de-Marcos, D. Rodriguez, y A. Alonso-Betanzos, «B. Seijo-Pardo, V. Bolón-Canedo, y A. Alonso-Betanzos, «On developing an automatic threshold applied to feature selection ensembles», Information Fusion, vol. 45, pp. 227-245, ene. 2019" has been accepted for publication in Information Fusion. The Version of Record is available online at https://doi.org/10.1016/j.inffus.2018.02.007[Abstract]: Feature selection ensemble methods are a recent approach aiming at adding diversity in sets of selected features, improving performance and obtaining more robust and stable results. However, using an ensemble introduces the need for an aggregation step to combine all the output methods that confirm the ensemble. Besides, when trying to improve computational efficiency, ranking methods that order all initial features are preferred, and so an additional thresholding step is also mandatory. In this work two different ensemble designs based on ranking methods are described. The main difference between them is the order in which the combination and thresholding steps are performed. In addition, a new automatic threshold based on the combination of three data complexity measures is proposed and compared with traditional thresholding approaches based on retaining a fixed percentage of features. The behavior of these methods was tested, according to the SVM classification accuracy, with satisfactory results, for three different scenarios: synthetic datasets and two types of real datasets (where sample size is much higher than feature size, and where feature size is much higher than sample size).This research has been financially supported in part by the Spanish Ministerio de Economa y Competitividad (research project TIN 2015-65069-C2-1-R), by the Xunta de Galicia (research projects GRC2014/035 and the Centro Singular de Investigación de Galicia, accreditation 2016–2019) and by the European Union (FEDER/ERDF).Xunta de Galicia; GRC2014/03
Los docentes que no han dejado de ser alumnos. Retos y experiencias en dos medios diferentes: online vs presencial
En este trabajo presentamos cómo ha sido nuestra primera experiencia docente en dos marcos distintos: por un lado en una asignatura presencial del Grado de Informática de la Universidade da Coruña y por el otro en una asignatura online en el Máster Universitario en Investigación en Inteligencia Artificial de la Universidad Internacional Menéndez Pelayo. La experiencia de impartir simultáneamente ambas asignaturas nos ha permitido conocer las diferencias entre estos dos tipos de enseñanza. Nuestra intención es poner de manifiesto cómo hemos solventado los retos que nos plantearon las dos asignaturas, a fin de que el lector pueda servirse de nuestras breves pero intensas peripecias docentes.In this work, we describe our first teaching experience in two different areas: a face-to-face subject in the Computer Science Degree of the University of A Coruña and an online subject in the Research Master’s Degree in Artificial Intelligence of the Menéndez Pelayo International University. The experience of teaching both subjects simultaneously has allowed us to know the differences between both areas. We want to show how we solved the challenges posed by these two subjects with the aim that the reader can use our brief but intense teaching adventures
Fusión de Información e Ensembles na Aprendizaxe Automática
Programa Oficial de Doutoramento en Computación. 5009V01[Abstract] Traditionally, machine learning methods have used a single learning model to solve
a particular problem. However, the idea of combining multiple models instead of a
single one to solve a problem has its rationale in the old proverb “Two heads are better
than one". The approach constructs a set of hypothesis using several different models,
that then are combined in order to be able to obtain better performance than learning
just one hypothesis using a unique method. There have been several studies that have
shown that these models obtain usually better accuracy than individual methods, due
to the diversity of the approaches and the control of the variance, taking advantage
of the strengths of the individual methods and overcome their weak points at the
same time. These combinations of models are called “committees", or more recently
“ensembles". Ensemble learning algorithms have reached great popularity among the
machine learning literature, as they achieve performances that were not possible some
years ago, and thus have become a “winning horse" in many applications.
Moreover, during the last years, the size of the datasets used in the area of machine
learning has considerably grown. Thus, dimensionality reduction has been a must almost
in any case, and among those preprocesing methods, feature selection (FS) has
become an essential preprocessing step for many data mining applications, eliminating
irrelevant and redundant information, and thus reducing storage requirements and
improving the computational time needed by the machine learning algorithms. Also,
several studies have demonstrated that feature selection can greatly contribute to improve
the performance of posterior classi_cation methods.
One of the main points to be addressed in this thesis is the application of the ensemble
learning idea to the feature selection process, with the aim of introducing diversity
and increasing the regularity of the process. Regularity is the ability of the ensemble
approach to obtain acceptable results regardless of the dataset under study and its
particular properties. It should also be mentioned that using ensemble approaches has
the added benefit of releasing the user from the task of selecting the most adequate
method for each dataset, and thus of the obligation of knowing technical details about
the existing algorithms. In this way, also more user-friendly FS methods are coming into scene.
Ensembles for feature selection are a recent proposal, and not many works can
be found in the literature. There are several steps that need to be confronted when
creating an ensemble for FS:
1. Create a set of different feature selectors, each one providing its output. In
order to create diversity, there are several methods that can be used, such as
using different samples of the training dataset, using different feature selection
methods, or a combination of both.
2. Aggregate the results obtained by the single models. There are several measures
that can be used in this step, such as majority voting, weighted voting, etc. It
is important to choose an adequate aggregation method, that is able to preserve
the diversity of the individual base models, while maintaining accuracy.
In this thesis, we have designed several approaches for the first aforementioned step:
(i) homogeneous approach, that is, using the same feature selection method with different
training data and distributing the dataset over several nodes (or several partitions);
and (ii) heterogeneous approach, i.e., using different feature selection methods with the
same training data. Regarding the second point above, we have also studied different
methods for combining the results obtained from the individual methods. Besides,
when the chosen individual selectors are rankers, at some point we needed to establish
a threshold to retain only the relevant features and to combine the rankings obtained
by the different methods configuring the ensemble. In this sense, we have analyzed two
different proposals, depending on whether thresholding was performed before or after
combination. Finally, a third novelty in this work is related to the need of establishing
an adequate threshold, and thus we propose a methodology for establishing automatic
thresholds based on measurements of data complexity. The adequacy of the methods proposed along this thesis was checked, so as to be able to extract a series of final conclusions. To this end, a variety of datasets of different types were used: synthetic, real “classical" (more samples than features) and real DNA
microarray datasets (more features than samples). In a first step, synthetic datasets
were used to perform the first tests and check the performance of the new implemented
methods. In a second step, real datasets (both classical and microarray) were used to
check the adequacy of new methods to problems presented in the real world, allowing us
to carry out a performance comparison and also to extract a series of final conclusions.
Finally, nowadays it is common to find missing data in real-world problems that the
proposed feature selection ensembles, as any other machine learning method, are likely
to face. Traditionally, the common way to deal with this situation was to delete those
samples that contained missing data, but this is not possible when the percentages of
missing data are important, and thus imputation is the newly common approach. However,
imputation before FS can lead to false positives: features that are not associated
with the target become dependent as a result of imputation. In this exploratory work
we use causal graphs to evidence the notion of structural bias, and develop a modi-
fied t-statistic test to analyze the possible bias that can be originated. Our conclusion
is that it is more advisable to devise feature selection methods that are “robust" to
the presence of missing data than imputing them. In this regard, the development of
ensemble feature selection in this scenario remains as the future line to pursue